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ABSTRACT. If you want to program a parallel computer, a purely functional language like Haskell 
is a promising starting point. Since the language is pure, it is by-default safe for parallel evalua- 
tion, whereas imperative languages are by-default unsafe. But that doesn't make it easy! Indeed it 
has proved quite difficult to get robust, scalable performance increases through parallel functional 
programming, especially as the number of processors increases. 

A particularly promising and well-studied approach to employing large numbers of processors is 
data parallelism. Blellochs pioneering work on NESL showed that it was possible to combine a 
rather flexible programming model (nested data parallelism) with a fast, scalable execution model 
(flat data parallelism). In this paper we describe Data Parallel Haskell, which embodies nested 
data parallelism in a modern, general-purpose language, implemented in a state-of-the-art compiler, 
GHC. We focus particularly on the vectorisation transformation, which transforms nested to flat data 
parallelism. 

1 Introduction 

Computers are no longer getting faster; instead, we will be offered computers containing 
more and more CPUs, each of which is no faster than the previous generation. As the 
number of CPUs increases, it becomes more and more difficult for a programmer to deal 
with the interactions of large numbers of threads. Moreover, the physical limitations of bus 
bandwidth will mean that memory access times will be increasingly non-uniform (even if 
the address space is shared), and locality of reference will be increasingly important. 

In the world of massively-parallel computing with strong locality requirements there 
is already a well-established, demonstrably successful brand leader, namely data parallelism. 
In a data-parallel computation one performs the same computation on a large collection of 
differing data values. Well-known examples of data-parallel programming environments are 
High Performance Fortran (HPF) [For97], the collective operations of the Message Passing 
Interface (MPI) [GHLL+98], NVIDIA's Compute Unified Device Architecture (CUD A) API 
for graphics processors [NVI07], and Google's map/ reduce framework [DG04]. 

All these systems support only flat data parallelism, in which the computation that is 
performed on each data element must itself be (a) sequential and (b) of a similar execution 
time to the computation on the other data elements. In practice, this severely limits the 
applications of data-parallel computing, especially for sparse or irregular problems [PCS99] . 
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( ! : ) : : [ :a: ] -> Int -> a 

sliceP :: [:a:] -> (Int, Int) -> [:a:] 

replicateP :: Int -> a -> [:a:] 

mapP :: (a->b) -> [:a:] -> [:b:] 

zipP : : [ :a: ] -> [:b: ] -> [ : (a,b) :] 

zipWithP :: (a->b->c) -> [:a:] -> [:b:] -> [:c:] 

filterP :: (a->Bool) -> [:a:] -> [:a:] 

concatP :: [:[:a:]:] -> [:a:] 

concatMapP :: (a -> [:b:]) -> [:a:] -> [:b:] 

unconcatP :: [:[:a:]:] -> [:b:] -> [:[:b:]:] 

transposeP :: [:[:a:]:] -> [:[:a:]:] 

expandP :: [:[:a:]:] -> [:b:] -> [:b:] 

combineP :: [:Bool:] -> [:a:] -> [:a:] -> [:a:] 

splitP :: [:Bool:] -> [:a:] -> ([:a:], [:a:]) 

Figure 1: Type signatures for parallel array operations 



Thus motivated, Blelloch and Sabot developed the idea of nested data parallelism in the early 
90's, and embodied it in their language NESL [BS90]. 

NESL was a seminal breakthrough but, fifteen years later it remains largely un-exploited. 
Our goal is to adopt the key insights of NESL, embody them in a modern, widely-used func- 
tional programming language, namely Haskell, and implement them in a state-of-the-art 
Haskell compiler (GHC). The resulting system, Data Parallel Haskell, will make nested data 
parallelism available to real users. 

Doing so is not straightforward. NESL a first-order language, has very few data types, 
was focused entirely on nested data parallelism, and its implementation is an interpreter. 
Haskell is a higher-order language with an extremely rich type system; it already includes 
several other sorts of parallel execution; and its implementation is a compiler. 

This paper makes two main contributions: 

• We give a tutorial, programmer 's-eye view of what programming in Data Parallel 
Haskell is like. Rather than a series of tiny examples, we give a serious application 
that is very hard to fully parallelise in a flat data-parallel setting, namely the Barnes- 
Hut algorithm for N-body simulation. 

• We give a detailed tutorial overview of the key vectorisation transformation. There are 
two major innovations over NESL: one is the non-parametric representation of arrays 
(Section 4) and one is the treatment of first-class functional values (Section 5). 

All the technical innovations in this paper have appeared, piecemeal, in our earlier 
publications. Our hope, however, is that this paper draws together a somewhat-complex 
set of technical strands into a comprehensible whole. 
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2 The programmer's view on DPH 

We begin by describing Data Parallel Haskell (DPH) purely from the point of view of the 
programmer, illustrating the description with a non-trivial example, the Barnes-Hut algo- 
rithm [BH86]. GHC supports other forms of concurrency besides data parallelism, but we 
focus here exclusively on the latter. Singh [SJ08] gives a tutorial covering a broader scope, 
including semi-implicit parallelism (par), explicit threads, transactional memory, as well as 
Data Parallel Haskell. 

DPH is simply Haskell with the following extra features: 

• A type of parallel arrays, denoted [ : e : ] for arrays of type e. These arrays are indexed 
by values of type Int. From a semantic point of view an array [ : a : ] is very similar 
to a list [ a ] - the difference is in the execution pragmatics. An array can contain 
elements of any type, including arrays and functions. 

• A large number of parallel operations that operate collectively on entire arrays. As far 
as possible, these operations have the same names as Haskell's standard list functions, 
but with the suffix P added — i.e., mapP, f ilterP, unzipP,and so forth. Figure 1 lists 
the operations that we will use in this paper. 

• Syntactic sugar, called parallel array comprehensions, which are similar to list compre- 
hensions but operate on parallel arrays. 

In addition to the parallel evaluation semantics, lists and parallel arrays also differ with 
respect to strictness: more precisely, demand for any element of a parallel array results in 
the evaluation of all elements. 

2.1 N-Body Barnes-Hut Simulation Algorithm 

We will demonstrate the use of DPH features using the Barnes-Hut n-body simulation al- 
gorithm as an example. We discuss the algorithm in some detail because it is a particularly 
striking example of the power of nested data parallelism, and of the utility of user-defined 
data types in data-parallel programs. We will, for the sake of clarity, restrict ourselves to 
two dimensions and neglect complications such as bodies that are very close to each other. 

An n-body simulation computes the motion under gravitational forces of n bodies, or 
particles. A naive solution is to compute the force between every pair of particles which 
requires n 2 calculations in each time step. The Barnes-Hut algorithm reduces the work com- 
plexity to the order of n log n interactions by grouping together particles which are close 
to each other and calculating the centre of gravity, or centroid of the cluster. The centroids 
are then used to approximate the effect the particles have on other particles which are suffi- 
ciently far away. The stricter we are in determining what exactly constitutes "sufficiently far 
away", the more precise the final result is, and the algorithm can be parametrised accord- 
ingly. 

The first phase of the algorithm determines the hierarchical grouping of the particles, 
computes the centroids of the clusters, and stores the result in a tree structure. To be more 
precise, the area is split into four subareas of equal size, the particles are grouped according 
to the subarea they are located in. We repeat this step for each subarea, and terminate if an 
area contains either none or only a single particle. Figure 2 illustrates the tree construction 
process for particles p\. . . P9. In the first iteration, the particles are split into four groups, 
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Figure 3: Rosetree 



depending on which quadrant they are located in. The upper right quadrant already con- 
tains only a single particle, so it isn't divided up any further. Both the upper left and the 
lower right quadrant require only one more iteration, the lower left two iterations. 

Figure 3 shows the resulting hierarchical tree structure: the root node contains the cen- 
troid of all particles and four subtrees (since all subareas contain at least one particle). Each 
of the subtrees contains the centroid of the corresponding subarea - which is the particle 
itself in case of a singular particle. 

The second phase of the algorithm now calculates the forces that affect each particle 
p, by traversing the tree from the root downwards: for every subtree, if the particle p is 
sufficiently far away from the centroid stored in the root of that subtree, the force on p is 
calculated using this centroid without looking at the rest of the tree. Otherwise, we add up 
the forces on p from the subtrees of the current root - and so on recursively. 

2.2 Encoding Barnes-Hut in DPH 

Since the only way to express parallelism in DPH is to apply collective operations to par- 
allel arrays, we need to store all data that we want to process in parallel in such an ar- 
ray. For instance, the function one Step, which computes one step in the simulation, takes 
a parallel array of particles as arguments, and returns an array of the same length, with 
the position and velocity of each particle adjusted according to the gravitational forces: 
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Compute one step of the n-body simulation 
oneStep :: [:Particle:] -> [:Particle:] 
oneStep particles = moveParticles particles forces 
where 

tree = buildTree initialArea particles 

forces = calcForces (lengthOf initialArea) tree particles 



buildTree 

calcForces 

moveParticles 



Area -> [:Particle:] -> Tree 
Float -> Tree -> [:Particle:] -> [: Force: 
[:Particle:] -> [:Force:] -> [:Particle:] 



lengthOf : : Area -> Float 

The function oneStep, as discussed before, is comprised of three data parallel phases. First, 
buildTree decomposes the particles into sub-areas, returning the resulting Tree. This tree 
is then used by calcForces to compute the forces on the particles, returning a new array 
of forces with one element for each particle. Finally, moveParticles uses these forces to 
adjust the positions and velocities the particles. 

The data types involved in the computation are defined exactly as the would be in reg- 
ular Haskell. For example, aParticleisa record of its mass, its location, and its velocity: 

type Vector = (Float, Float) 

type Area = (Vector, Vector) 

type Force = Vector 

type Velocity = Vector 

type Location = Vector 



data Particle 



Particle { mass 

, location 
, velocity 



Float 
Location 
Velocity } 



Some functions are conveniently defined using the parallel array counterparts of ordinary 
list processing functions (see Figure 1). For example, we can define moveParticles like 
this: 

moveParticles :: [:Particle:] -> [ :Force: ] -> [:Particle:] 
moveParticles ps fs = zipWithP moveParticle ps fs 



moveParticle : : Particle -> Force -> Particle 
moveParticle (Particle { mass = m 

, location = loc 

, velocity = vel 

force 

= Particle { mass = m 

, location = loc + vel * 

, velocity = vel + accel * 
where 

accel = force / m 

Now we turn our attention to the Tree data type and its construction. When building and 
traversing a Tree, we want to process its sub-trees in parallel, and so we must use a parallel 
array for the children: 

data Tree = Node Mass Location [:Tree:] 

-- Rose tree for spatial decomposition 



}) 



timeStep 
timeStep } 
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This time, unlike the flat array of particles (which may be very long), the array of sub-trees 
has at most four elements at any level (recall that we are working with only 2 dimensions). 
To build a tree, we perform recursive descent over the area: 

Perform spatial decomposition and build the tree 
buildTree :: Area -> [:Particle:] -> Tree 

buildTree area [: p :] = Node (mass p) (location p) [::] 
buildTree area particles = Node m 1 subtrees 
where 

(m, 1) = calcCentroid subtrees 
subtrees = [ : buildTree a ps 

I a <- splitArea area 

, let ps = [:p | p <- particles, inArea a p:] 
, lengthP ps > 0 : ] 



inArea : : Area -> Particle -> Bool 

inArea ( (lx, ly) , (hx, hy) ) (Particle { location = (x,y) }) 
= lx <= x && x <= hx && ly <= y && y <= hy 

splitArea :: Area -> [:Area:] 

— splitArea returns the four sub-areas in a parallel array 
calcCentroid :: [:Tree:] -> (Mass, Location) 

The first equation deals with the case of a single particle: we simply record its mass and 
location. In the recursive case, the array comprehension for subtrees iterates in parallel 
over (splitArea area), an array of exactly four elements. For each such area a, we 
compute the set of particles ps that lie inside a and, if that set is non-empty, we recursively 
call buildTree. The "if non-empty" test discards sub-areas which do not contain any 
particles at all, so the length of subtrees can be anything between 1 and 4. We omit the 
implementations of inArea and calcCentroid, since they are straightforward. 

The nested comprehension in the where clause of buildTree makes sure that inArea 
is called on every subarea/ particle combination in a single parallel step. Another source of 
nested parallelism in buildTree are the recursive calls to the parallel function buildTree, 
which are performed simultaneously on however many sub-areas contain particles (from 
one to four). The number of parallel steps is hence proportional to the depth of the rose tree. 

Lastly, we have to write the function calcForces, which, given a Tree and an array 
of particles, calculates the forces applied by the Tree on those particles. It can do so by 
dividing the particles into two groups: those that are "far" from the centre of gravity of the 
Tree (as determined by a function isFar), and those that are "near". Here is the code: 

calcForces :: Float -> Tree -> [:Particle:] -> [:Force:] 

calcForces len (Node m 1 ts) ps 
= let 

far_forces = [: forceOn p m 1 | p <- ps, isFar len 1 p :] 

near_ps = [: p | p <- ps, not (isFar len 1 p) :] 

near_f orces_s = [: calcForces (len / 2) t near_ps | t <- ts :] 
near_f orces = [ : sumForces p_f orces 

I p_forces <- transposeP near_f orces_s : ] 

in 

combineP [:isFar len 1 p | p <- ps:] far_f orces near_f orces 
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forceOn : : Particle -> Mass -> Location -> Force 

isFar : : Float -> Location -> Particle -> Bool 

sumForces :: [ :Force: ] -> Force 
The function calcForces divides the particles into two groups: those that are "near" 
the centroid 1 of tree, and those that are not. For the far particles, we simply use f orceOn 
to compute the force on each such particle from the tree, giving f ar_f orces. For particles 
near to 1, near_ps, we recursively use calcForces (in parallel) to compute the force on 
each particle from each sub-tree giving near_f orces_s, a short vector with one element 
for each sub-tree. Each element is a vector with one element for each particle, giving the 
force on that particle from the sub-tree. All that remains is to transpose this nested structure, 
and add up the forces on each particle. Finally, we must re-combine the near and far forces, 
using combineP, which interleaves two vectors as directed by a boolean mask. 

2.3 Communication and locality 

Here is an alternative, simpler way to write calcForces: 

calcForces :: Tree -> [:Particle:] -> [ :Force: ] 
calcForces tree ps = mapP (calc t) ps 
where 

calc (Node m 1 ts) p 

| isFar 1 p = forceOn p m 1 

| otherwise = sumForces [: calc t p | t <- ts :] 
For each particle (the mapP), it recurses down the tree, stopping when the centroid of the 
sub-tree is far away from the particle. 

Which version should we prefer? Different ways of writing the code give rise to dif- 
ferent patterns of data communication. In this latter version you can see that every particle 
needs a copy of (at least the top part of) the tree, so the danger here is that most of the tree 
ends up being copied to most of the processors. In the earlier version, the particles migrate 
(in smaller and smaller groups) to the tree, rather than the other way around. 

It undoubtedly complicates the programmer's life to have to think about these mat- 
ters, but there is no silver bullet. Parallel programming is complicated, and programmers 
must think about concurrency and communication, as well as correctness. However, one 
of the advantages of the data-parallel style is that it gives us a much better handle on the 
program's cost model (both computation and communication) than un-structured parallel 
programming [Ble96]. 

2.4 Summary 

The algorithm we have described makes extensive use of data parallelism. For example, 
buildTree is called in parallel on the four sub-areas of the area under consideration; and 
for each of those sub-areas we compute the relevant subset of the particles in parallel. 
Similarly calcForces is called in parallel on the four sub-trees; and the computation of 
far_f orces is done in parallel over all the particles. In each case, the recursive calls over 
the sub-trees express nested data parallelism, because the computation that is performed 
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on the sub-tree is itself a data-parallel computation. This is really quite difficult to express 
using flat data parallel frameworks; indeed tree construction is often not parallelised. 

The rest of this paper uses the parallel Barnes-Hut algorithm as a running example to 
explain the successive steps through which the program is compiled to run efficiently on 
parallel shared-memory machines. 

3 Compiling DPH programs 

The compiler must translate high-level nested data parallel programs, as described in the 
previous section, into efficient low-level code. This translation consists of four main steps: 

• Desugaring removes syntactic sugar, reducing the program to a simple lambda lan- 
guage. This intermediate language, GHC's "Core" language, is still strongly typed. 

• Vectorisation transforms nested data parallelism into flat data parallelism; it is a Core- 
to-Core transformation. 

• Fusion optimises the Core program, by eliminating redundant synchronisation points 
and intermediate arrays, thus dramatically improves locality of reference; 

• Gang parallelism divides the parallel operations spatially into chunks, each chunk being 
executed by a thread from a gang of threads. Typically a gang contains a thread for 
each CPU. Gang parallelism is expressed by giving library implementations of the 
"vector instructions", rather than by built-in compiler support. 

GHC implements these steps using a large number of Core-to-Core program transforma- 
tions. Many of these transformations have been part of GHC's optimiser for a long time, 
in particular a sophisticated inliner, worker-wrapper unboxing, and constructor specialisa- 
tion [Pey96, PM02, PL91, PTH01]. In the course of the Data Parallel Haskell project, we 
are adding more, array-specific transformations. Due to GHC's generic support for pro- 
gram transformations — specifically, the inliner and rewrite rules [PM02, PTH01] — we 
can implement most of these new transformations as library code, as opposed to extending 
the compiler itself. Indeed, apart from the vectorisation pass, the rest of the optimisation 
pipeline operates in ignorance of the fact that the program being optimised is a data parallel 
one. 

In this paper we focus mainly on vectorisation, starting at Section 3.2, after taking a 
brief diversion to describe how array comprehensions are desugared (Section 3.1). 

3.1 Desugaring array comprehensions 

In the Barnes-Hut code we used both array comprehensions, and ordinary functions over par- 
allel arrays such as zipP and mapP. However, just as in the case of list comprehensions, 
the former is just a convenient syntactic sugar for the latter. More precisely, Figure 4 gives 
rules for desugaring array comprehensions. They are quite standard [JW07], and practically 
identical to those for lists, so we do not discuss them further. These rules are simple, but 
they should be thought of as a specification rather than an implementation, because they 
generate somewhat inefficient code. In GHC's actual implementation we use slightly more 
complicated rules. 
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Expressions e ::= ... | [:e I q : ] 
Qualifiers p, q ::= x<-e \ e \ p , q \ p \q 

V[[[ :e I q: ]] = mapP (A^.e) Q|fa] 

Q\cj\ computes the parallel array of the tuples generated by q 

Q\x<-e\ = e 

Q\e\ = if e then [:():] else [: :] 
Q[[p,q] = concatMapP (Ap^.mapP (\q v .{p v ,q v )) QM)Q[pl 
QKplq] = zipPQMQW 

t/c is a tuple of the variables bound by q 

(x<-e) v = x 

(g)v = () 

(/M)p = (Pl7/<?1») 

(pi?)* = (Pv,qv) 
Figure 4: Desugaring rules for array comprehensions 

3.2 Informal overview of vectorisation 

The purpose of vectorisation is to take a program that uses nested data parallelism, and 
transform it into a program that uses only flat data parallelism. Consider this tiny example 

f : : Float -> Float 
f X = x*x + 1 

For every such function we build its lifted version f l thus: 
f L :: [:Float:] -> [:Float:] 
fL x = (x *l x) +l (replicateP n 1) 
where 

n = lengthP x 
Internally, f l uses "vector instructions" like +l to do its work, where 

+ L :: [: Float:] -> [:Float:] -> [:Float:] 
Notice that it must also replicate the constant 1 so that argument has the type that +l expects. 
So, roughly speaking (we give the true story later), to form the definition of f l we transform 
the body of f in the following way: 

• Replace a constant by a call to replicateP. 

• Replace a function by its lifted versions (e.g. + becomes 

• Replace a parameter (e.g. x) by itself. 

This new definition obeys the equation f l = mapP f , so it takes an array to an array. In effect, 
it is a specialised variant of mapP - specialised by fixing the function argument. The idea 
is that whenever we see the call (mapP f ) we will replace it by fL- But there is a problem! 
Suppose we have 

g :: [:Float:] -> [:Float:] 

g xs = mapP f xs 



Harnessing the Multicores 



First we replace (mapP f)byf L toget: 
g : : [ :Float : ] -> [ :Float : ] 
g xs = fL xs 

But now we must lift g too, in case there are calls to (mapP g). If we try, we get this: 
g L : : [ : [: Float:] : ] -> [ : [ :Float: ] : ] 
gL xs = f LL xs 

Not good: we need the doubly-lifted version of f ! If the depth of nesting is not statically 
bounded (and it isn't in Barnes-Hut) then we are in trouble. Blelloch's clever solution is to 
observe that we can define f ll in terms of f l, thus: 

f LL :: [ : t :Float: ] : ] -> [:[: Float :]: ] 

f ll xss = unconcatP xss (fL (concatP xss) ) 
That is: first concatenate all the rows of xss to make a single flat vector; then map f over 
that vector; then chop up the result to form a vector of vectors again, guided by the original 
shape of x s s . (Note that the incoming vector might well be "ragged", so that not all the sub- 
vectors have the same length.) At first, this idea looks terribly inefficient, because of all the 
flattening and un-flattening but, as we shall see, if we choose the right data representation, 
concatP and unconcatP take constant time and involve no copying. 

This is the core of the vectorisation transformation. We have left many details vague. 
What about higher order functions? What about user-defined data types? We now start 
to tighten our description up. We begin by discussing how to represent arrays (Section 4) 
and functions (Section 5) in vectorised code. These representation choices in turn drive the 
vectorisation transformation (Section 6). More details are given in previous work [KC98, 
CKOO, LCK06, CLP+07]. 

4 Representing arrays in vectorised code 

Standard arrays in Haskell are parametric; i.e., the array representation is independent of 
the type of array elements. This is achieved by using arrays of pointers referring to the 
actual element data. Such a boxed representation is very flexible, but it is also detrimental to 
performance. The indirections consume additional memory, increase memory traffic, and 
decrease locality of memory access. The resulting runtime penalty can be very significant. 

The parallel arrays [ : a : ] offered by the DPH source language are also parametric, 
as can be seen from the polymorphic type signatures in Figure 1. One of the tasks of the 
vectoriser is to change the array representation, by systematically transforming a function 
that manipulates values of type [ : (Int, Int) : ], say, to one that manipulates values of 
type PA ( Int , Int ) . These new PA arrays have a non-parametric representation; that is, the 
representation depends on the element type [CKOO]. For example, a value of type PA Int is held 
as a contiguous memory area containing unboxed 32-bit integer values — not as a block of 
pointers to Int-valued thunks, as is the case in vanilla Haskell. 

Although PA is not visible to the user, such non-parametric data types are an inde- 
penently-useful source-language feature, already implemented in GHC, which we call an 
associated data type [CKPM05]. We will therefore explain PA using the notation of associated 
data types. 
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Since the representation of an array depends on the type of its elements, there can be 
no useful polymorphic functions over PA. For example, we cannot define 

lengthPA :: PA a -> Int — WRONG! 
because lengthPA, being polymorphic in a, knows nothing about the representation of 
PA a. This is just what type classes are for. So we declare the type PA in association with a 
class PAElem that defines operations over the type, thus: 
class PAElem a where 
data PA a 

indexPA : : PA a -> Int -> a 

lengthPA : : PA a -> Int 
replicatePA : : Int -> a -> PA a 
. . .more operations. . . 
Given a type a that is allowed to be an element of a parallel array, there is a corresponding 
data type PA a, and operations indexPA, lengthPA, replicatePA, and so on. These 
operations therefore have overloaded types, thus: 

indexPA : : PAElem a => PA a -> Int -> a 
lengthPA : : PAElem a => PA a -> Int 
. . . etc . . . 

All the parametric operations of Figure 1 have PA variants with the same types apart from 
the additional (PAElem a) constraint. (Our concrete implementation is more complex 
with more operations, but the code shown here conveys the basic idea.) 

An instance declaration fills in an implementation for each of these elements. For ex- 
ample, the instance declaration for integers takes the following form: 
class PAElem Int where 

data PA Int = AInt ByteArray 

indexPA (AInt ba) i = indexIntArray ba i 

lengthPA (AInt ba) = lengthlntArray ba 

replicatePA n i = AInt ( replicatelntArray n i) 

. . .more operations. . . 
We represent the array by a contiguous region of bytes (aka ByteArray) with primitives 
such as indexIntArray that operate on individual 32-bit integers from a ByteArray. 
(The code again simplifies the concrete implementation by omitting the use of unboxed 
types.) 

4.1 Arrays of structured data 

The PAElem instance for Float, and other primitive types, follows the same pattern. But 
what about more complex data structures, such as an array of pairs? It is quite unacceptable 
to represent it by an array of pointers to (heap-allocated) records, because the indirection 
costs would be too heavy. Instead, we represent it by a pair of arrays: 
class (PAElem a, PAElem b) => PAElem (a, b) where 
data PA (a,b) = ATup 2 Int (PA a) (PAb) 

indexPA (ATup2 _ arrl arr2) i = (indexPA arrl i, indexPA arr2 iv) 
lengthPA (ATup2 n ) = n 
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Thus, a PA (Float, Float) is represented by a pair of unboxed arrays, each storing a 
vector of floating point values. Crucially, the two arrays must have the same length; and we 
record that length in the Int field of the ATup2 constructor. This length field is convenient, 
but usually redundant — but not always! Consider an array of ( ) elements: 
class PAElem () where 
data PA () = ATup 0 Int 
indexPA (ATupo _) i = 0 
lengthPA (ATupo n) = n 
We need no data storage to store a vector of ( ) values, but we must still remember its length. 

Notice that the representation is compositional; that is, the representation of an array of 
pairs is given by combining the representations of an array of the first and second elements 
of the pair, and so on recursively. 

The representation also allows us to combine two arrays element-wise into an array of 
pairs in constant time, with unzipping being equally easy: 

zipPA :: PAElem a => PA a -> PA b -> PA (a,b) 
zipPA as bs = ATup2 (lengthPA as) as bs 

unzipPA :: PA (a,b) -> (PA a, PA b) 

unzipPA (ATup_2 _ as bs) = (as,bs) 
This stands in contrast to lists, where zipping and unzipping take linear time. 

Lastly, since records are converted into product types by the desugarer, the Particle 
arrays in Barnes-Hut are represented by tuples of arrays. 

4.2 Nested arrays 

Even more interesting is the representation of nested arrays. A classic example is that of 
sparse matrices, in which we represent a sparse matrix as a vector of rows, each row con- 
sisting of a vector of (index, value) pairs, where only the non-zero values in the row are 
represented. Thus 

type SparseMatrix a = [ : [ : (Int, a) : ] : ] 
Since our ultimate goal is to eliminate nested parallelism, it is not surprising that we also 
want to represent nested arrays in terms of flat ones. Indeed, a nested array PA ( P A a ) can 
be encoded by 

• a flat data array of type PA a which contains the data elements and 

• a segment descriptor of type PA (Int, Int) which stores the starting position and 
length of the subarrays embedded in the flat data array. 

This is captured by the following instance: 

class PAElem a => PAElem (PA a) where 

data PA (PA a) = AArr (PA a) (PA (Int, Int)) 

indexPA (AArr arr segd) i = slicePA arr (indexPA segd i) 

lengthPA (AArr _ seg) = lengthPA seg 

where sliceP extracts a subarray from a larger array in constant time. Thus, the sparse 

matrix 

[: [: (0,15), (2,9), (3,20) :], [::], [: (3,46) :] :] 
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Figure 5: Value of type [ : Tree : ] and its vectorised representation 



will be represented as 

AArr (ATup 2 [#0,2,3,3#] [#15, 9, 20, 46#] ) —Data 

(ATup 2 [#0,3,3#] [#3,0, 1#]) — Segment descriptor 

where we write [#...#] for a literal ByteArray. The first ByteArray contains all the 
column indexes, the second one all the Floats, and the third and fourth the start indexes 
and lengths of the segments, respectively. Since all four ByteArray are unboxed, programs 
which process such matrices can be compiled to highly efficient code. 

Remarkably, we can now give constant-time implementations of the two functions 
concatPA and unconcatPA, as promised in Section 3.2: 

concatPA :: PA (PA a) -> PA a 

concatPA (AArr cts _) = cts 

unconcatPA :: PA (PA a) -> PA b -> PA (PAb) 
unconcatPA (AArr _ shape) cts = AArr cts shape 

4.3 Recursive types 

If the array elements are recursive, the non-parametric representation of the array has to be 
recursive, too. For instance, arrays of Tree from Section 2.2 are represented as follows: 
instance PAElem Tree where 

data PA Tree = ATree Int (PA Mass) (PA Location) (PA (PA Tree)) 
Since Tree is a product, the representation is similar to arrays of tuples. It stores the masses 
and locations of the centroids and a nested array containing the subtrees of each node. As 
described in the previous section, the latter is encoded by a flat array of trees together with 
a segment descriptor. In effect, this means that an array of trees is represented by a list 
with each element containing the centroids and segmentation information for one tree level. 
This allows all data in one level to be processed in parallel, although the levels have to be 
processed one after another. Figure 5 illustrates this representation. User-defined types are 
discussed in more detail in Section 6.4. 

4.4 Polymorphism 

If we were only interested in monomorphic code, or if we would use a whole-program com- 
piler that specialises a polymorphic to a monomorphic program, as was the case in NESL, 
then life would have been much easier. We could implement the non-parametric array type 
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by statically replacing PA Int by PAlnt, say where the latter is a perfectly ordinary data 
type, defined as 

data PAlnt = AInt ByteArray 
Now we do not need non-parametric types; in the vectorised code, original types [ : Int : ] , 
[: (Int, Float) :], etc, are simply replaced by PAlnt, PAPair PAlnt PAFloat, and so 
on, where all these are ordinary data types. 

Alas, this does not work for polymorphic functions. For example, how could we translate 
the type of this function? What would we statically replace [ : a : ] by? 

firstRow :: [:[:a:]:] -> [:a:] 
We also cannot turn polymorphic functions into families of monomorphic functions, as we 
support separate compilation and polymorphic recursion. No — if we want polymorphism, 
we must use something akin to type classes, as we have described in this section. A key 
component of our work is the extension of the non-parametric representation idea to work in 
a polymorphic setting. In particular, our Core language regards PA as a type-level function 
from types to types [SCPD07]. 

5 Representing functions in vectorised code 

Haskell is a higher order language, so we have to consider how to vectorise programs that 
manipulate functions. Vectorising higher-order programs raises two distinct problems. 

5.1 Functions are pairs 

Consider this (contrived) definition: 

ho :: (Int->Bool) -> (Bool, [:Bool:]) 

ho f = (f 2, mapP f [:1,2,3:]) 
In our overview (Section 3.2), we said that we should replace (mapP f ) with a call to f^, 
the lifted version of f . But since f is lambda-bound, it is not so easy to call "the lifted version 
of f ". Clearly the caller must pass the lifted version of f as a parameter to ho. But there 
is an ordinary, scalar call ( f x) in the body of ho, so we can't pass only the lifted version. 
The obvious alternative is to pass a pair that gives both the lifted and unlifted versions. With 
such a representation, mapP can just extract the lifted variant of its argument (a pair), while 
the vanilla application of f extracts the unlifted variant. 

5.2 Functions are closures 

There is a second challenge. In Section 4 we discussed the efficient, non-parametric repre- 
sentation of data-parallel arrays. A higher order language forces us to confront the question 
of how to represent an array of functions. For example: 

distance :: [:Float:] -> [: Float->Float :] 
distance xs = mapP (Ax y. sqrt (x*x + y*y) ) xs 

distY :: [:Float:] -> Float -> Float 

distY xs y = sumP [ : d y | d <- distance xs : ] 
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There are more direct ways to write distY, of course, but arrays of functions also arise 
inevitably when we think about lifting. If we start with 

f : : (Int->Int) -> Int 
then the lifted version of f has the type 

f L : : [: Int->Int :] -> [ : Int : ] 
How should we represent such an array of functions? A possible answer is "as an array 
of pointers to function closures". Bad answer! In data-parallel computations, all the pro- 
cessors are supposed to execute the same code in parallel, but if every function closure in 
the array is potentially different, they clearly cannot do that. Furthermore, a pointer-based 
representation destroys locality. 

Happily, there is no need for the generality of a distinct function pointer for each array 
element. Consider the result of distance, for example. Every element of this array is 
a function with the same code, but a different value for the function's free variable x. This 
makes the solution obvious: we must represent an array of functions by a pair of a single 
code pointer and an array of environment records, which give the per-element bindings for 
the free variables. 

5.3 Putting it together 

Putting our two solutions together, we see that in vectorised code a function must be repre- 
sented by a triple: 

1. The scalar version of the function 

2. The lifted version of the function 

3. An environment record of the free variables of the function 

To be concrete, here is the data type declaration for vectorised functions: 

data (a :-> b) = forall e. PAElem e => 

Clo { env : : e 

, clo s : : e -> a -> b 
, clo/ :: PA e -> PA a -> PA b } 
This declaration says that ( : -> ) is an algebraic data type (written infix), with a single con- 
structor Clo. The constructor Clo has an existentially-quantified type variable e, and three 
fields, env, clo s , and clo/. The vectorisation transformation will transform every function 
type Ti->T2 to t[ : ->x' v where r{ is the transformed version of Ti and similarly for T2. In 
effect, the vectorisation transform performs closure conversion [AJ89]. 

With this definition in hand, we can now explain how arrays of values of type (a : - > b ) 
are represented: 

instance PAElem (a : -> b) where 

data PA (a :-> b) = forall e. PAElem e 

=> AClo { aenv : : PA e 

, aclos : : e -> a -> b 
, aclo; :: PA e -> PA a -> PA b } 
lengthPA (AClo env f s f/) = lengthPA env 
indexPA (AClo env f s f;) n = Clo (indexPA env n) f s f/ 
replicatePA n (Clo env f s f/) = AClo (replicatePA n env) f s f/ 
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To represent an array of functions, we keep a single code pointer for each of the scalar and 
lifted code, but have an array of environment records. Notice that Clo and AClo differ only 
in the type of the environment field; their clo s and clo/ fields are identical. 

As in the case of other types, it is worth noting that this representation supports very 
simple and direct implementations of indexing, replication, and so on. It does not efficiently 
support literal arrays of various different functions, such as [:sin,cos:]. This is quite 
deliberate: in a data-parallel computation all the processors should be performing the same 
computation at the same time. Nevertheless, such arrays can be handled, essentially using 
conditionals which ensure that different functions are executed one after another. 

In Section 6.2, we will see how Clo and AClo are used in the transformation of the ho 
example. Before we can do so, we must first specify the vectorisation transformation more 
precisely. 

6 Vectorisation 

We are finally ready to discuss the vectorisation transformation itself. Consider a top-level 
function definition f :: r = e, where r is the type off. The full vectorisation transformation 
produces a definition for the vectorised version of / called /v, thus: 

/v"V,[t]=VH 

Here, fy is the fully vectorised variant of /, whose right-hand side is generated by the full 
vectorisation transform V [•]. As we have already seen, vectorisation returns an expression 
of a different type to the input, so the type of fy is obtained by vectorising the type t, 
thus Vj[t]. In general, if e :: t then V [e] :: Vf[rJ. Figure 6 gives the functions for both 
type and term vectorisation. In our compiler, the transformation applies to an explicitly- 
typed program, but we omit all type information in Figure 6, in order to concentrate on the 
essentials. 

Vectorisation is applied separately to each top level function in the program, so it is a 
whole-program transformation. In real programs, only a part will be data-parallel, while 
much of it is not (e.g. input/output, user interaction etc). We ignore this issue here, but in 
reality our compiler performs selective vectorisation - see Section 6.5 and [CLJK08]. 

The type transformation Vt [t] transforms a source-program type to the corresponding 
type in the vectorised program. As can be seen in Figure 6, its effect is simple: it transforms 
every function arrow (->) to a vectorised function arrow (:->), and every parametric 
array constructor [ : : ] to a non-parametric parallel array constructor PA. A user-defined 
algebraic data type might have nested uses of (->) or [ : : ] — for example, Tree does so 
— and for these we must generate a vectorised variant (Treey) of the data type itself. We 
elaborate this point in Section 6.4. 

This type transformation forces the vectorised program to differ quite radically from 
the original. In particular, since a "function" is now a triple constructed with Clo, we need 
an infix application operator $ : to extract the scalar copy: 
($: ) : : (a :-> b) -> a -> b 
($:) (Clo env fs fl) = fs env 
and a lifted version 
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V f [r] :: Type 

V f [T 1 ->T 2 ] 
V f [[:r:]] 
V t [lnt] 
V f [Float] 
V f [TTi...T„] 



Type is the vectorisation transformation on types 

V t [ii] :->V f [T 2 ] Functions 

Parallel arrays 



Primitive scalar types 



Int 
Float 

T v V t [ri] . . . Vf[r n ] Algebraic data types (e.g. lists) 



£ t [r] = PAVtlrj 



V [e] :: Term — » Term is the full vectorisation transformation on terms 
Invariant: if x, : <7, h e : t then x; : V t \cTi\ h V [e] : Vt[r] 

V [)t] = )t fc is a literal 

If I = /v /is bound at top level 

V |x] = x x is locally bound (lambda, let, etc) 

V\e l e 2 \ =V{e 1 j $: V [e 2 ] 

V [Ax.e] = Clo {env = (y lr ...,y k ) 

, clo s = Aex. case e of (yi, . . . ,y^) — » V [e] 
, clo/ = Aex. case e of ATupjt n y\ . . . y k — » £ [e] n} 
where{yi, . . . ,yfc} = free variables of Ax.e 

if ei 

= if Vfci] then V [e 2 ] else V [e 3 ] 



V 



then e 2 
e 1 s e e 3 



£ 



£ [e] n :: Term — > Term — > Term is the lifting transformation on terms 
Invariant: if x, : W[ h e : t then x,- : £f[cr,J h £ [e] n : £([t] 
where n is the length of the result array 



cm n 
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= replicatePA n k 
= replicatePA n fy 
= x 

= £[ei] n $: L £ [e 2 ] n 
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Figure 6: The vectorisation transformation 
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($: L ) :: PA (a:->b) -> PA a -> PA b 
($:l) (AClo env fs fl) = fl env 

The transformation rules in Figure 6 are given with their type invariants, which make 
the rules much more comprehensible. For example, consider the rule for V \e\ 62]- Since 
e\ : T\->Tz, we know that V [ei] : Vf[ii] : ->Vf [T2]; that is why we need the application 
function ($: ) to transform the application to an expression of type V/ [T2]. 

Similarly, the rule for V [Ax.e] must produce a value of type V/ [ti] : ->Vt [T2], and that 
in turn must be built with a Clo constructor. We build an environment tuple (yi, . . . ,i/jt), 
of the free variables of (Ax.e). Now the type of the arguments of Clo tell us what functions 
we must build. The scalar function simply requires a recursive use of V [e], while the lifted 
function requires us to generate a lifted version of the code for e, C\e\ n. 

The rules for C [e] n can be read in the same way. The main new complication is with 
conditionals. First we compute in parallel e' v the vector of booleans (of length n) for the 
discriminant of the conditional. Then we use that vector to split the a vector of environment 
tuples into two parts, ysz (for which corresponding elements of e[ is true), and 1/S3 (for which 
e[ is false). The lengths n^, n^ of these vectors will sum to n. Then we compute each of e' 2 
and e' 3 in parallel, and finally interleave them together with combinePA. 

Why do we need to pack and split the free variables in the conditional rule? Each free 
variable y, is bound to an n-vector; but in the then branch we need a (shorter) ^-vector 
(namely ys2) of the elements of y* for which e is True; and dually for the else branch. We 
must also test for n^ or n$ being zero (done by CJ \e\ n), otherwise when transforming a re- 
cursive function we would generate a program that recurses infinitely deep. The operational 
behaviour of the translated function will compute e[, e' 2 and e' 3 in sequence; as in any data- 
parallel machine, the "then" and "else" branches of a conditional are computed separately. 

Figure 6 is the core of this paper. Our real system handles let expressions, case ex- 
pressions, and constructors, and hence is a bit more complicated. But Figure 6 describes all 
the essential ideas. 

Since V invokes both V and L (as does C) you might worry about a code explosion. 
But notice that the clo s field in V is identical to the aclo s field in C, and both are closed 
functions that can be named, and bound at top level; and similarly for the clo/ and aclo/ 
fields. Hence, as we will see in the examples that follow, we can avoid the code explosion 
simply by naming and sharing these functions. 

6.1 A simple example 

Here is the simplest possible example: 

inc : : Float -> Float 
inc = Ax. x + 1 
The full vectorisation transformation in Figure 6 gives us this: 

incy Float : -> Float 
incy = Clo () incs incL 

incs :: 0 -> Float -> Float 

incg = Ae x. case e of () -> (+)v $: x $: 1 



Peyton Jones, Leshchinskiy, Keller, Chakravarty 



FSTTCS 2008 



incL :: PA ( ) -> PA Float -> PA Float 

incL = Ae x. case e of ATupo n -> (+)v $:l x $:l (replicatePA n 1) 
To aid explanation we have named incs and incL, but otherwise we have simply applied 
Figure 6 blindly. Notice the way we have systematically transformed inc's type, replacing 
( - > ) by (:->). Notice too that this transformation neatly embodies the idea that we need 
two versions of every top-level function inc, a scalar version incs and a lifted version incL- 
These two versions paired together to form the fully vectorised version incy- 

The vectorised code makes use of vectorised addition ( + ) , which is part of a fixed, 
hand-written library of vectorised primitives: 

(+)v — Float : -> Float : -> Float 
(+)v = Clo () (+) S (+)l 

(+) s :: 0 "> Float -> Float : -> Float 
(+)s = Ae x . Clo x addFloats addFloatL 

( + ) L PA () -> PAFloat -> PA (Float : ->Float ) 
(+)l = Ae xs . AClo xs addFloats addFloatL 

-- Implemented in the back end 

addFloats : : Float -> Float -> Float 

addFloatL :: PAFloat -> PAFloat -> PAFloat 
The intermediate functions ( + ) g and ( + ) l deal with partial applications of ( + ) . Finally we 
reach ground truth: invocations of addFloats and addFloatL, which are implemented 
by the back end. The former is the ordinary floating point addition instruction; the latter 
is a "vector instruction", which will be implemented differently on different targets. On a 
sequential machine it will be implemented as a loop; on a GPU it will be implemented using 
vector hardware; on a cluster it will be implemented using a loop on each CPU with barrier 
synchronisation at the end. Section 7 elaborates. 

These functions look grotesquely inefficient, especially considering how trivial the orig- 
inal function inc was. Fortunately, most of the clutter is introduced to account for the possi- 
bility of higher order programming, and can be removed by straightforward optimisations. 

For example, consider the sub-term ( + ) v $: x $: 1 in the definition of incs- We can 
simplify it in the following way: 

( + ) v $: x $: 1) 
^ Inline ( + )v 

(Clo () ( + ) s ( + ) L ) $: x $: 1 
^ Definition of $: 
( + )S 0 x $: 1 
^ Inline ( + ) s 

(Clo x addFloatg addFloat L ) $: 1 

^ Definition of $: 

addFloats x 1 
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All the intermediate closure data structures are removed. (To save generating huge interme- 
diates during compilation, we are exploring whether the vectorisation transformation could 
have special cases to avoid introducing them in the first place.) 

6.2 The higher order example again 

It is instructive to see a case where the use of higher order functions prevents complete 
removal of intermediate closures. Let us return to the ho example of Section 5.1: 

ho :: (Int->Bool) -> (Bool, [:Bool:]) 
ho f = (f 2, mapP f [ : 1, 2, 3 : ] ) 
Again applying the vectorisation transformation blindly we get this: 

ho v :: (Int : -> Bool) : -> (Bool, PABool) 
hoy = Clo () hos hoL 

hos :: 0 -> (Int : -> Bool) -> (Bool, PABool) 
ho s 0 f = (f $: 2, mapPy $: f $: [:1,2,3:]) 

ho L :: PA () -> PA (Int : -> Bool) -> PA (Bool, PABool) 
ho L (ATup_0 n) fs 

= (,) L (fs $:l replicatePA n 2) 

(replicatePA n mapPy $:l fs $:l replicatePA n [:1,2,3:]) 
We have taken a short-cut here by using optimised transformation rules for pairs: 

V\{e ll e 1 )\ = (VW,VH) 
C[(e lt e 2 )j n = (,) L n (£ n) (C{e 2 j n) 

The reader may verify the correctness of this optimised rule by seeing what happens instead 
if we use the normal translation ( , ) y $: V\e\\ $: V \e\\, and the definition of ( , ) y, which 
in turn is very like that for ( + ) . Because of our array representation, the lifted pairing 
function ( , ) L is a constant-time operation: 

(,) L :: Int -> PA a -> PA b -> PA (a,b) 

(, )l n xs ys = ATup2 n xs ys 

6.3 How flattening happens 

In our informal overview (Section 3.2) we said that we "replace a call (mapP f) by fi/'. 
Higher order flattening takes that static decision and makes it dynamic, by representing f by 
a pair of functions, thereby allowing mapP to select at runtime. (With the usual compile-time 
optimisations when f is known, of course.) The code for mapP itself is therefore the heart of 
the way in which nested data parallelism is transformed to flat data parallelism. Here it is: 

mapPy :: (a :-> b) : -> PA a : -> PA b 

mapPy = Clo () mapPi mapP2 



mapPi :: () -> (a :-> b) -> PA a : -> PA b 
mapPi _ f = Clo f mapP$ mapPL 
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mapP 2 :: PA ( ) -> PA (a : -> b) -> PA (PA a : -> PAb) 
mapP2 _ fs = AClo fs mapPs mapPL 

mapPs :: (a :-> b) -> PA a -> PAb 
mapPg (Clo env fs fl) xss 

= fl (replicatePA (lengthPA xss) env) xss 

mapP L :: PA (a : -> b) -> PA (PA a) -> PA (PAb) 
mapPL (AClo env _ fl) xss 

= unconcatPA xss (fl (expandPA xss env) (concatPA xss)) 

— xss : : PA (PA a) 
-- env : : PA e 

— fl : : PA e -> PA a -> PAb 

In mapPs we exploit the key observation from Section 3.2, namely that we can define the 
doubly-lifted function using the singly-lifted one f 1, using constant-time reshaping opera- 
tions on the data. Unfortunately, to account for free variables, we face a small complication: 
the environment env contains one element for each subarray of xss. Thus, before apply- 
ing f 1 we must expand env, i.e., repeat each element as many times as the corresponding 
subarray of xss has elements. For top-level functions, the environment will be empty and 
expandPA performs no work. 

Of course, mapP is not the only function that the library must implement. All of (the 
PA versions of) the functions in Figure 1 must be provided in vectorised form. For example, 
here is the lifted version of zipP (the definition of zipPA is given in Section 4.1): 
zipP L :: PA(PAa) -> PA (PAb) -> PA(PAa,b) 

zipP^ (AArr segd xs) (AArr _ ys) = AArr segd (zipPA xs ys) 
These library functions are the heart of flattening: they make nested data parallelism "go". 
Everything is organised to make their implementation, especially their lifted variants, work 
efficiently. 

6.4 User-defined data types 

One of Haskell's strengths is the ease with which programmers can declare new algebraic 
data types, and process them using pattern matching. DPH allows all of this expressiveness 
in fully-vectorised code as well. There are two main complications: occurrences of (->) 
and [ : : ] in user-defined data types; and representing arrays of values drawn from such 
types. We discuss each in turn. 

Vectorising user-defined data types 

In Figure 6, the type transform Vf [r] replaces a user-defined data type T by its vectorised 
counterpart Tv. But what exactly is Ty? Consider 

data Fun = MkFun (Int -> Int) 
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Remember that in the vectorised program, each function arrow (->) must be replaced by 
a function closure ( : -> ) — and of course that must also happen inside data types. So we 
must generate a vectorised version of Fun, thus: 

data Funy = MkFuny (Int : -> Int) 
This must be done recursively: if a constructor of data type T mentions Fun, then T too must 
have a vectorised version. So the vectorised variant of each data type obtained by simply 
applying the Vf[] transform to every type in the data type declaration. The Tree type of 
Section 2.2 is another good example, because we must replace [ : : ] by PA: 

data Treey = Nodey Mass Location (PA Tree) 
While we can generate a vectorised version of every data type, it is unnecessary to do 
so for data types that do not mention functions or parallel arrays. Happily, almost all data 
types fall into this category; for example Bool, Maybe, lists, tuples, and so on. We quietly 
took advantage of this in the Tree example, by not transforming Mass to Massy (and simi- 
larly Location) because Mass = Massy. In Section 6.5 we will see a second reason to avoid 
vectorising a data type unless it is absolutely necessary to do so. 

Arrays of user-defined data types 

The ideas of Section 4.1 can readily be extended to work for arbitrary user-defined algebraic 
data types. We have already seen how this works for Tree in Section 4.3. Here is another 
example, a sum type: 

data Maybe a = Nothing | Just a 
How can we represent an array of Maybe Float values? The natural dense representation 
is as a pair of (a) an array of booleans (True for Nothing, and False for Just), and (b) an 
array of Float containing only the Just values: 

instance PAElem a => PAElem (Maybe a) where 
data PA Maybe a = AMaybe (PA Bool) (PA a) 
indexPA (AMaybe bs vs) i 



lengthPA (AMaybe bs _) = lengthPA bs 
In practice, to avoid computing j u st_indi ce s on each indexing operation we precompute 
the index vector, and cache it in an extra field of the AMaybe constructor. 

In our real implementation, we avoid generating a big instance declaration for every 
such user-defined data type, by instead generating code to convert it to a simple sum-of- 
products representation, and then using a set of fixed instances for PAElem at those repre- 
sentation types. 

6.5 What we have swept under the carpet 

Vectorisation is a complicated transformation, and to keep it comprehensible we have sim- 
plified several aspects. In this section we briefly mention some of them. 



I indexPA bs i 
I otherwise 
where 



= Nothing 

= indexPA vs (indexPA just-indices i) 



just_indices 



= scanPA (+) 0 (mapPA boolToInt bs) 
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Types and dictionaries 

The alert and Haskell-savvy reader will have noticed the following discontinuity in our 
presentation. We described PA type in association with a type class, PAElem. However, type 
classes are dealt with by the type inference system, right at the front end of the compiler, 
and are completely translated out in the passage to the Core intermediate language. In this 
desugaring, a function with an overloaded type, such as nub :: Eq a => [a] -> [a] 
is given a second parameter which is a record, or "dictionary", of the functions that imple- 
ment the operations of the Eq class. 

In the desugared program, nub has type EqD a -> [a] -> [a] , where EqD is an 
ordinary data type, thus: 

data EqD a = EqD { (==) : : a -> a -> Bool 

, (/=) : : a -> a -> Bool } 
Correspondingly, the desugarer injects an extra argument at every call to nub, namely the 
correct method suite for that particular call site. 

The vectoriser generates many calls to replicatePA, splitPA, etc, which have type- 
class-constrained types, yet the vectoriser runs after typechecking and desugaring are complete. So 
the vectoriser cannot take advantage of the implicit injection of extra arguments; instead it 
must insert them itself. In the real implementation of Figure 6, the vectoriser therefore adds 
appropriate dictionary abstractions and applications. (In fact, since GHC's Core language 
is an explicitly-typed variant of System F, we also inject type abstractions and applications.) 
All this is tiresome but routine; showing the implicit abstractions and applications in Fig- 
ure 6 would have dramatically obfuscated an already-dense figure. 

Selective vectorisation 

As mentioned earlier, we do not really vectorise the whole program; rather, we selectively 
vectorise parts of it. We must also generate marshaling code to allow us to "cross the bor- 
der" between vectorised and unvectorised code. For example, in Barnes-Hut, we presum- 
ably want to vectorise the one Step function, which will give us 

oneStepv PA Particle : -> PA Particle 
If we want to be able to call oneStep from ordinary scalar code, we must generate the 
following marshalling code: 

oneStep :: [:Particle:] -> [:Particle:] 
oneStep ps = fromPA (oneStepv $: (toPA ps) ) 

toPA :: PAElem a => [:a:] -> PA a 

fromPA :: PAElem a => PA a -> [:a:] 
Marshaling may also be necessary for user-defined data types.. For example, suppose we 
vectorise a function f :: Int -> Fun, so that fy :: Int :-> Funy (cf. Section 6.4 for 
the definition of Fun). If we want to call f from normal scalar code, we must generate: 

f : : Int -> Fun 

f n = case fy n of 

MkFuny tf -> MkFun (($:) tf) 
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Of course, it gets worse if the data type is recursive, because the marshaling code has to 
traverse the whole structure. On the other hand, no marshaling is needed for types that 
have have no functions or arrays inside them, which is a strong reason for exploiting that 
special case (Section 6.4). 

Marshaling has a run-time cost. In particular, the calls to toPA and f romPA change 
the data representation for parallel arrays, and so are potentially very expensive. In fact, it 
is possible to choose a representation for [ : a : ] that mitigates these costs somewhat but in 
general, marshaling data across the border should be avoided. 

The question of just which parts of the program to vectorise is therefore an interesting 
one. We want to vectorise code that can run in parallel; we want to reduce marshaling to 
a minimum; and we do not want to vectorise code where there is little or no benefit. We 
suggest automatic approaches in [CLJK08], but it may also be reasonable to seek help from 
the programmer (e.g. "vectorise module X but not module Y"). 

Laziness 

Consider this function: 

f : : Int -> Int 
f x = h x (1/x) 

Although x might be zero, let us assume that h only evaluates its second argument if its 
first argument is non-zero. Haskell's lazy evaluation therefore ensures that no divide-by- 
zero exception is raised. 

The lifted version will look something like this: 

f L : : PA Int -> PA Int -> PA Int 

fL xs = hL xs (replicatePA (lengthPA xs) 1 /l xs) 
The trouble is that a demand for any element of hi/s second argument will force all the 
elements to be evaluated, including the divisions by zero. Something very similar arises in 
a more local context when we have let expressions: 

f x = let y = 1/x in 

if x==0 then 0 else y+1 
Although this is something of a corner case, we do not yet have a very satisfying solution. 
We currently simply ignore the problem, and accept the slight change in semantics. A better 
solution might be to reify the exception into an exceptional value (like a IEEE NaN); but 
that carries an efficiency cost. Lastly, we might treat the argument as a miliary function, 
accepting the loss of sharing that would result. 

7 Multicore execution model 

The vectorisation transformation turns all nested data parallelism into parallel operations 
on flat arrays, as used by the instances of the PAElem class. The transformation is crucial to 
express parallel algorithms on a high-level of abstraction and in a modular fashion. How- 
ever, purely functional array operations, even if restricted to flat arrays, are still a far cry 
from the hardware model of multicore CPUs. 
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In particular, the vectorised code uses many superfluous intermediate arrays, which 
increase the overhead of memory management and whose creation involves extra synchro- 
nisation between parallel CPUs. Even worse, the repeated traversal of large structures com- 
promises locality of reference, and so, has a very negative effect on execution performance. 
Finally, we need to map the data parallel array operations onto the multi-threaded execu- 
tion model of multicore CPUs by way of the Single Program Multiple Data (SPMD) model 
[DarOl]. 

In contrast to the vectorisation transformation, we can implement the mapping from 
flat data-parallel code to SPMD code using existing transformation and optimisation phases 
of GHC; in particular, we make heavy use of GHC's inliner and rewrite rules [PM02, PTH01], 
which enable library-specific optimisations as part of library source code, in the form of 
compiler pragmas. Consequently, we can implement these transformations without alter- 
ing GHC's source code, which greatly simplifies experimentation with different transforma- 
tions. 

In the reminder of this section, we illustrate the transformation of flat data-parallel code 
into SPMD code by way of an example. Further details are in [KC99, CK03, CSL07, CLS07, 
CLP+07]. 

7.1 Running example 

As an example, we consider the computation of the value f ar_f orces, in the function 
calcForces of Section 2.2, by way of the array comprehension, 

[: forceOn p m 1 | p <- ps, isFar len 1 p :] 
After vectorisation and simplification to remove intermediate closure data structures, we 
have 

forceOn'L (filterPg (isFar len 1) ps) m 1 
The code performs two collective operations on the input array ps in sequence. Firstly, the 
application of f i It erPg to remove all particles that are not far, and secondly, a computation 
that corresponds to lifting f orceOn only on its first argument (here called, forceOn'L): 

forceOn'L PAParticle -> Mass -> Location -> PA Force 
Such pipelines of collective operations are typical for data-parallel code. 

7.2 The SPMD execution model 

The implementation of collective array operations, such as mapP and f ilterP, needs to 
distribute the workload evenly across the the available processing elements (PEs), such as 
multiple cores and CPUs. In the data-parallel model, the workload of a PE is dependent 
on the number of array elements residing on that PE. Hence, we balance work by suitably 
distributing the array elements. By default, we choose an even distribution; i.e., given p PEs 
and an array of length n, each PE gets about n/p array elements. 

In the SPMD model, the individual PEs process local array elements until they arrive at 
a point in the computation where they require non-local data, and need to cooperate with 
other PEs. In our example, the result of filterPg is such a point. Even if the input to 
f ilterPg is an array that is evenly spread across the PEs, the output of filterPg might 
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be wildly unbalanced, depending on which elements of the array are selected by the pred- 
icate. If so, any further processing of that array would have an equally unbalanced work 
distribution. 

To avoid a work imbalance, arrays need to be re-distributed when their size changes. 
Redistribution is a cooperative process in which all PEs need to coordinate. However, re- 
distribution is not the only such operation in an SPMD implementation of data parallelism. 
Other prominent cooperative operations are reductions (such as f oldP), pre-scans (such as 
s canlP), and permutation operations. Overall, a parallel program executing in SPMD-style 
alternates between processing phases, where the PEs operate independently on local data, and 
communication phases, where the processing elements interact and exchange data. 

Communication phases are typically expensive because they include data exchange 
and blocking to allow any slower PEs to catch up. Hence, compiler optimisations that 
remove communication phases in favour of longer-running processing phases are often 
worthwhile. In particular, the redistribution of arrays after operations that change the array 
length, such as f ilterPg, does not necessarily improve overall runtime. An inexpensive, 
purely local operation may be faster, even if work is not ideally balanced, than an expensive 
redistribution followed by the same local operation with a perfectly balanced workload. 

7.3 Gang parallelism 

Our implementation of the SPMD model for data parallelism is based on the coordinated ex- 
ecution of a gang of threads, with one thread per PE. GHC includes a Haskell library for con- 
current programming with explicit thread forking and thread communication primitives. It 
forms the lowest level of abstraction in our data-parallel array framework and enables us to 
implement the entire library in Haskell without any special compiler support or the need to 
resort to C code. 

We need to make the distributed nature of computations in the SPMD model explicit to 
further compile the code resulting from vectorisation, such as 

forceOn'L (filterPg (isFar len 1) ps) m 1 
In this context, distribution does not imply that the data is necessarily located on physically 
distinct memory banks, but that different threads are responsible for the processing of dif- 
ferent portions of parallel arrays. By being explicit about distribution, we are automatically 
also explicit about the distinction of processing phases versus communication phases. 

Our main vehicle for distinguishing between these two phases and making distribu- 
tion explicit is the type Dist a of distributed values. For instance, Dist I nt, pronounced 
"distributed Int", denotes a collection of local integers, such that there is one local integer 
value per gang thread. Arrays can be distributed, too: Dist [ : Float : ] is a collection of 
local array chunks, again one per gang member, which together make up the array. Arrays 
are distributed across gang members and joined back together by the following functions: 
splitD : : PA a -> Dist (PA a) 
joinD : : Dist (PA a) -> PA a 
Distributed values support a number of operations, most importantly mapping: 
mapD : : (a -> b) -> Dist a -> Dist b 
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While splitD and joinD denotes communication, mapD is the main means of implement- 
ing parallel processing phases: the gang members concurrently apply the (purely sequen- 
tial) function to their respective local values. 

7.4 Inlining of gang code 

Given a sequential filter function operating on a single chunk of a parallel array 

filters : : (a -> Bool) -> PA a -> PA b 
we can define filterPg as a distributed gang computation as follows: 

filterPg p arr = joinD (mapD (filters p) (splitD arr) ) 
Given a global parallel array, which is not distributed, splitD distributes the array across 
the gang, mapD (filters p) applies the sequential filter function in parallel to all chunks 
of the distributed array, and finally, joinD combines the various chunks, which may now 
be of varying length, into one global array. 

Similarly, forceOn'L internally consists of mapDs that compute the force for each par- 
ticle. The force computations for the individual particles are entirely independent, so we 
can assume forceOn' L to have the following structure: 
forceOn'L ps m 1 

= joinD (mapD (mapS (Ap. forceOn p m 1) ) (splitD ps) ) 
where forceOn is the original, sequential function from the source of our Barnes-Hut im- 
plementation and mapS is a purely sequential array mapping function. 

GHC's inliner will inline the definition of both filterPg and forceOn' l; i.e., it will 
perform the following rewriting: 

forceOn'L (filterPg (isFar len 1) ps) m 1 
^Mining 

joinD (mapD (mapS (Ap. forceOn p m 1) ) ( 

splitD (joinD (mapD (filters (isFar len 1)) (splitD ps) ) ) ) ) 
Of special interest here is the function splitD which is applied to the immediate result 
of joinD (in the second line of the resulting expression). This turns a distributed array into 
a global array and distributes it again. In contrast to the original array, the newly distributed 
one is guaranteed to be distributed evenly; hence, a splitD/ joinD combination performs 
load balancing. 

However, as we remarked earlier, it is often an advantage to accept some load imbal- 
ance in favour of avoiding communication phases in an SPMD computation. In GHC, we 
easily achieve that by specifying the following rewrite rule: 

"splitD/ joinD" forall xs . splitD (joinD xs) = xs 
GHC has support for specifying such rewrite rules directly in the library source code as 
compiler pragmas [PTH01]. Applications of the splitD/ joinD rule frequently produce 
two adjacent applications of mapD, which signal two adjacent purely sequential and thread- 
local computations. We can combine them, and hence eliminate a synchronisation point, 
using the well known map fusion law: 

"mapD/mapD" forall f g xs . 

mapD f (mapD g xs) = mapD (f . g) xs 
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Applying both rules to our example, we get 

joinD (mapD (mapS (Ap. forceOn p m 1) ) ( 

splitD (joinD (mapD (filters (isFar len 1)) (splitD ps) ) ) ) ) 

^ Apply splitD/ joinD 

joinD (mapD (mapS (Ap. forceOn p m 1) ) ( 
mapD (filters (isFar len 1)) (splitD ps))) 

^ Apply mapD/mapD 

joinD (mapD (mapS (Ap. forceOn p m 1) . filters (isFar len 1)) 
(splitD ps) ) ) 

At this point, the question arises whether we can combine adjacent sequential array com- 
binators, such as mapS and filters, to reduce the number of array traversals and inter- 
mediate data structures. Indeed, we aggressively remove such inefficiencies using a fusion 
framework known as stream fusion [CSL07, CLS07, CLP+07], but we will refrain from dis- 
cussing this in detail. 

This concludes our brief overview of the post-vectorisation aspects of Data Parallel 
Haskell. A somewhat more detailed discussion can be found in [CLP+07]. 

8 Related work 

We discussed prior work on the implementation of language support for nested data par- 
allelism in detail in [CLP+07]. In this paper, we will only give a brief overview of existing 
work, and how they compare to our approach. 

The starting point for our work was the nested data parallel programming model of 
NESL [Ble90, BCH+94], which we extended and implemented in the context of a general- 
purpose language and GHC, a state-of-the-art compiler. Consequently, we have to deal 
with a multitude of issues not previously addressed, as for example the combination of 
user-defined and parallel data structures, selective vectorisation, higher-order functions, 
separate compilation, and aggressive cross-function optimisation. 

Prins et al. worked on various aspects of the vectorisation of nested data parallel pro- 
grams; see, e.g., [PP93, PPW95]. Most of their work was also in the context of a functional 
language, but one that like NESL lacks many of Haskell's features. Their work is largely 
orthogonal to ours. 

The Proteus system [MNPR94] promised a combination of data and control parallelism, 
but Proteus had a particular focus on manual refinement of algorithms, where data paral- 
lel components were automatically vectorised, this again was a complete whole-program 
transformation. Moreover, the system was never fully implemented. 

Manticore [FFR+07] supports a range of forms of parallelism including nested data 
parallelism. Manticore employs some of the same techniques that we use, but does not 
implement flattening yet [FRRS08]. According to the project web page, a preliminary im- 
plementation of the Manticore system should be available around the time when this paper 
is published. 

So et al. [SGW06] developed a parallel library of immutable arrays for C/C++ support- 
ing what they call sub-primitive fusion. Their choice of immutable arrays, despite working 
with imperative languages, is to enable aggressive program transformations, much like in 
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our approach. However, where we apply transformations statically during compile time, 
their library builds a representation of the to be executed computation at runtime. Conse- 
quently, they require less compiler support and do not have to worry about inlining and 
similar optimisations. However, they incur a runtime penalty by performing optimisations 
at runtime and need to amortise that penalty by further optimisations. Like us, they also 
strive for a seamless integration of data parallelism and explicit concurrency within a single 
program. 



9 Conclusion 

We are excited about Data Parallel Haskell because it gives us some chance of writing par- 
allel programs that can in principle efficiently exploit very large parallel machines working 
on large data sets. 

In this paper we have outlined solutions to the challenges of polymorphism, higher 
order functions, and user-defined data types. There is much to do, however, before we 
can declare victory. The very generality of Data Parallel Haskell makes it an ambitious 
undertaking. Many components have to work together smoothly to generate efficient code 
— and that is before we start to consider matters such as using SSE vector instructions or 
GPUs, or mapping to a distributed memory architecture. Nevertheless, we regard nested 
data parallelism general, and Data Parallel Haskell in particular, as a very promising and 
exciting approach to harnessing the multicores. 
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